ddpg algorithm
Optimizing Drivers' Discount Order Acceptance Strategies: A Policy-Improved Deep Deterministic Policy Gradient Framework
Dai, Hanwen, Gao, Chang, He, Fang, Ji, Congyuan, Yang, Yanni
The rapid expansion of platform integration has emerged as an effective solution to mitigate market fragmentation by consolidating multiple ride-hailing platforms into a single application. To address heterogeneous passenger preferences, third-party integrators provide Discount Express service delivered by express drivers at lower trip fares. For the individual platform, encouraging broader participation of drivers in Discount Express services has the potential to expand the accessible demand pool and improve matching efficiency, but often at the cost of reduced profit margins. This study aims to dynamically manage drivers' acceptance of Discount Express from the perspective of an individual platform. The lack of historical data under the new business model necessitates online learning. However, early-stage exploration through trial and error can be costly in practice, highlighting the need for reliable early-stage performance in real-world deployment. To address these challenges, this study formulates the decision regarding the proportion of drivers accepting discount orders as a continuous control task. In response to the high stochasticity and the opaque matching mechanisms employed by third-party integrator, we propose an innovative policy-improved deep deterministic policy gradient (pi-DDPG) framework. The proposed framework incorporates a refiner module to boost policy performance during the early training phase. A customized simulator based on a real-world dataset is developed to validate the effectiveness of the proposed pi-DDPG. Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses, enhancing its applicability to practical ride-hailing scenarios.
Deep reinforcement learning for optimal trading with partial information
Macrì, Andrea, Jaimungal, Sebastian, Lillo, Fabrizio
Reinforcement Learning (RL) applied to financial problems has been the subject of a lively area of research. The use of RL for optimal trading strategies that exploit latent information in the market is, to the best of our knowledge, not widely tackled. In this paper we study an optimal trading problem, where a trading signal follows an Ornstein-Uhlenbeck process with regime-switching dynamics. We employ a blend of RL and Recurrent Neural Networks (RNN) in order to make the most at extracting underlying information from the trading signal with latent parameters. The latent parameters driving mean reversion, speed, and volatility are filtered from observations of the signal, and trading strategies are derived via RL. To address this problem, we propose three Deep Deterministic Policy Gradient (DDPG)-based algorithms that integrate Gated Recurrent Unit (GRU) networks to capture temporal dependencies in the signal. The first, a one -step approach (hid-DDPG), directly encodes hidden states from the GRU into the RL trader. The second and third are two-step methods: one (prob-DDPG) makes use of posterior regime probability estimates, while the other (reg-DDPG) relies on forecasts of the next signal value. Through extensive simulations with increasingly complex Markovian regime dynamics for the trading signal's parameters, as well as an empirical application to equity pair trading, we find that prob-DDPG achieves superior cumulative rewards and exhibits more interpretable strategies. By contrast, reg-DDPG provides limited benefits, while hid-DDPG offers intermediate performance with less interpretable strategies. Our results show that the quality and structure of the information supplied to the agent are crucial: embedding probabilistic insights into latent regimes substantially improves both profitability and robustness of reinforcement learning-based trading strategies.
Hierarchical Deep Deterministic Policy Gradient for Autonomous Maze Navigation of Mobile Robots
Hu, Wenjie, Zhou, Ye, Ho, Hann Woei
Maze navigation is a fundamental challenge in robotics, requiring agents to traverse complex environments efficiently. While the Deep Deterministic Policy Gradient (DDPG) algorithm excels in control tasks, its performance in maze navigation suffers from sparse rewards, inefficient exploration, and long-horizon planning difficulties, often leading to low success rates and average rewards, sometimes even failing to achieve effective navigation. To address these limitations, this paper proposes an efficient Hierarchical DDPG (HDDPG) algorithm, which includes high-level and low-level policies. The high-level policy employs an advanced DDPG framework to generate intermediate subgoals from a long-term perspective and on a higher temporal scale. The low-level policy, also powered by the improved DDPG algorithm, generates primitive actions by observing current states and following the subgoal assigned by the high-level policy. The proposed method enhances stability with off-policy correction, refining subgoal assignments by relabeling historical experiences. Additionally, adaptive parameter space noise is utilized to improve exploration, and a reshaped intrinsic-extrinsic reward function is employed to boost learning efficiency. Further optimizations, including gradient clipping and Xavier initialization, are employed to improve robustness. The proposed algorithm is rigorously evaluated through numerical simulation experiments executed using the Robot Operating System (ROS) and Gazebo. Regarding the three distinct final targets in autonomous maze navigation tasks, HDDPG significantly overcomes the limitations of standard DDPG and its variants, improving the success rate by at least 56.59% and boosting the average reward by a minimum of 519.03 compared to baseline algorithms. Keywords: Hierarchical Deep Reinforcement Learning; Deep Deterministic Policy Gradient; Off-policy Correction; Adaptive Parameter Space Noise; Reshaping Reward Function; Autonomous Maze Navigation1. Introduction In recent years, autonomous navigation for mobile robots has emerged as a pivotal research focus, driven by its potential to enhance efficiency and safety across diverse applications, which mainly strive to enable mobile robots to reach the target point while effectively avoiding obstacles [1, 2]. Maze navigation, as one of the diverse autonomous navigation tasks, poses distinct challenges. This problem is particularly relevant in critical scenarios such as warehouse logistics, search and rescue operations in disaster-stricken areas, and autonomous exploration of hazardous or unknown environments, where precise and adaptive path planning is crucial for success.
Broad Critic Deep Actor Reinforcement Learning for Continuous Control
Thalagala, Shiron, Wong, Pak Kin, Wang, Xiaozheng
In the domain of continuous control, deep reinforcement learning (DRL) demonstrates promising results. However, the dependence of DRL on deep neural networks (DNNs) results in the demand for extensive data and increased computational complexity. To address this issue, a novel hybrid architecture for actor-critic reinforcement learning (RL) algorithms is introduced. The proposed architecture integrates the broad learning system (BLS) with DNN, aiming to merge the strengths of both distinct architectural paradigms. Specifically, the critic network is implemented using BLS, while the actor network is constructed with a DNN. For the estimations of the critic network parameters, ridge regression is employed, and the parameters of the actor network are optimized through gradient descent. The effectiveness of the proposed algorithm is evaluated by applying it to two classic continuous control tasks, and its performance is compared with the widely recognized deep deterministic policy gradient (DDPG) algorithm. Numerical results show that the proposed algorithm is superior to the DDPG algorithm in terms of computational efficiency, along with an accelerated learning trajectory. Application of the proposed algorithm in other actor-critic RL algorithms is suggested for investigation in future studies.
Trajectory Tracking Using Frenet Coordinates with Deep Deterministic Policy Gradient
Jiang, Tongzhou, Liu, Lipeng, Jiang, Junyue, Zheng, Tianyao, Jin, Yuhui, Xu, Kunpeng
This paper studies the application of the DDPG algorithm in trajectory-tracking tasks and proposes a trajectorytracking control method combined with Frenet coordinate system. By converting the vehicle's position and velocity information from the Cartesian coordinate system to Frenet coordinate system, this method can more accurately describe the vehicle's deviation and travel distance relative to the center line of the road. The DDPG algorithm adopts the Actor-Critic framework, uses deep neural networks for strategy and value evaluation, and combines the experience replay mechanism and target network to improve the algorithm's stability and data utilization efficiency. Experimental results show that the DDPG algorithm based on Frenet coordinate system performs well in trajectory-tracking tasks in complex environments, achieves high-precision and stable path tracking, and demonstrates its application potential in autonomous driving and intelligent transportation systems. Keywords- DDPG; path tracking; robot navigation
Deep Reinforcement Learning for Online Optimal Execution Strategies
Micheli, Alessandro, Monod, Mélodie
This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) to address this issue, with a focus on transient price impact modeled by a general decay kernel. Through numerical experiments with various decay kernels, we show that our algorithm successfully approximates the optimal execution strategy. Additionally, the proposed algorithm demonstrates adaptability to evolving market conditions, where parameters fluctuate over time. Our findings also show that modern reinforcement learning algorithms can provide a solution that reduces the need for frequent and inefficient human intervention in optimal execution tasks.
Learning Agents With Prioritization and Parameter Noise in Continuous State and Action Space
Mangannavar, Rajesh, Srinivasaraghavan, Gopalakrishnan
Among the many variants of RL, an important class of problems is where the state and action spaces are continuous -- autonomous robots, autonomous vehicles, optimal control are all examples of such problems that can lend themselves naturally to reinforcement based algorithms, and have continuous state and action spaces. In this paper, we introduce a prioritized form of a combination of state-of-the-art approaches such as Deep Q-learning (DQN) and Deep Deterministic Policy Gradient (DDPG) to outperform the earlier results for continuous state and action space problems. Our experiments also involve the use of parameter noise during training resulting in more robust deep RL models outperforming the earlier results significantly. We believe these results are a valuable addition for continuous state and action space problems.
Autonomous Navigation of Unmanned Vehicle Through Deep Reinforcement Learning
Xu, Letian, Liu, Jiabei, Zhao, Haopeng, Zheng, Tianyao, Jiang, Tongzhou, Liu, Lipeng
This paper explores the method of achieving autonomous navigation of unmanned vehicles through Deep Reinforcement Learning (DRL). The focus is on using the Deep Deterministic Policy Gradient (DDPG) algorithm to address issues in high-dimensional continuous action spaces. The paper details the model of a Ackermann robot and the structure and application of the DDPG algorithm. Experiments were conducted in a simulation environment to verify the feasibility of the improved algorithm. The results demonstrate that the DDPG algorithm outperforms traditional Deep Q-Network (DQN) and Double Deep Q-Network (DDQN) algorithms in path planning tasks.
Reconfigurable Intelligent Surface Aided Vehicular Edge Computing: Joint Phase-shift Optimization and Multi-User Power Allocation
Qi, Kangwei, Wu, Qiong, Fan, Pingyi, Cheng, Nan, Chen, Wen, Letaief, Khaled B.
Vehicular edge computing (VEC) is an emerging technology with significant potential in the field of internet of vehicles (IoV), enabling vehicles to perform intensive computational tasks locally or offload them to nearby edge devices. However, the quality of communication links may be severely deteriorated due to obstacles such as buildings, impeding the offloading process. To address this challenge, we introduce the use of Reconfigurable Intelligent Surfaces (RIS), which provide alternative communication pathways to assist vehicular communication. By dynamically adjusting the phase-shift of the RIS, the performance of VEC systems can be substantially improved. In this work, we consider a RIS-assisted VEC system, and design an optimal scheme for local execution power, offloading power, and RIS phase-shift, where random task arrivals and channel variations are taken into account. To address the scheme, we propose an innovative deep reinforcement learning (DRL) framework that combines the Deep Deterministic Policy Gradient (DDPG) algorithm for optimizing RIS phase-shift coefficients and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm for optimizing the power allocation of vehicle user (VU). Simulation results show that our proposed scheme outperforms the traditional centralized DDPG, Twin Delayed Deep Deterministic Policy Gradient (TD3) and some typical stochastic schemes.
Digital Twin-Native AI-Driven Service Architecture for Industrial Networks
Duran, Kubra, Broadbent, Matthew, Yurdakul, Gokhan, Canberk, Berk
The dramatic increase in the connectivity demand results in an excessive amount of Internet of Things (IoT) sensors. To meet the management needs of these large-scale networks, such as accurate monitoring and learning capabilities, Digital Twin (DT) is the key enabler. However, current attempts regarding DT implementations remain insufficient due to the perpetual connectivity requirements of IoT networks. Furthermore, the sensor data streaming in IoT networks cause higher processing time than traditional methods. In addition to these, the current intelligent mechanisms cannot perform well due to the spatiotemporal changes in the implemented IoT network scenario. To handle these challenges, we propose a DT-native AI-driven service architecture in support of the concept of IoT networks. Within the proposed DT-native architecture, we implement a TCP-based data flow pipeline and a Reinforcement Learning (RL)-based learner model. We apply the proposed architecture to one of the broad concepts of IoT networks, the Internet of Vehicles (IoV). We measure the efficiency of our proposed architecture and note ~30% processing time-saving thanks to the TCP-based data flow pipeline. Moreover, we test the performance of the learner model by applying several learning rate combinations for actor and critic networks and highlight the most successive model.